In these exercises we will see the power of the libraries ggplot2 and plotly to make sense of statistical data. The goal is to reproduce the moving chart that you can see in this video from Hans Rosling – I invite you to watch his other videos, they are quite enlightning and inspiring:



For this, we will need to gather the data:


1 Data handling

The first thing to do is to load and regroup all these datasets into a single one.

  1. Load the tidyverse library and, using read_csv(), load the 4 datasets in 4 separate data.frames called children, income, pop and religion.
  1. To reproduce the chart on the video, we need to determine the dominant religion in each country. In the religion dataset, add a column Religion that will give the name of the dominant religion for each country. For this, you might want to use this method that returns the name of the column containing the maximum of each row of a data.frame:
##   V1 V2 V3
## 1  2  7  9
## 2  8  3  6
## 3  1  5  4
## [1] "V3" "V1" "V2"
  1. Using pivot_longer(), make all datasets tidy.
  • children should now contain 3 columns: Country, Year and Fertility.
  • income should now contain 3 columns: Country, Year and Income.
  • pop should now contain 3 columns: Country, Year and Population.

We will only consider data from 1800 to 2018. Example of syntax using the pipe operator %>%:

## # A tibble: 2 x 5
##   name  `2010` `2011` `2012` `2014`
##   <chr>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Kevin     10     11     12    123
## 2 Jane     122     56     23      4
## # A tibble: 6 x 3
##   name   Year Score
##   <chr> <dbl> <dbl>
## 1 Kevin  2010    10
## 2 Kevin  2011    11
## 3 Kevin  2012    12
## 4 Jane   2010   122
## 5 Jane   2011    56
## 6 Jane   2012    23

The line names_transform=list(Year = as.numeric) is here to convert the character year values to numerical values.

  1. Now we want to combine all these datasets into a single one called dat, containing the columns Country, Year, Population, Religion, Fertility and Income. Look into the inner_join() function of the dplyr library (which is part of the tidyverse library). For the religion dataset, we will consider that the proportions of 2010 are representative of all times.

You should end up with a dataset like this one:

## # A tibble: 37,887 x 6
##    Country      Year Fertility Income Population Religion
##    <chr>       <dbl>     <dbl>  <dbl>      <dbl> <chr>   
##  1 Afghanistan  1800         7    603    3280000 Muslims 
##  2 Afghanistan  1801         7    603    3280000 Muslims 
##  3 Afghanistan  1802         7    603    3280000 Muslims 
##  4 Afghanistan  1803         7    603    3280000 Muslims 
##  5 Afghanistan  1804         7    603    3280000 Muslims 
##  6 Afghanistan  1805         7    603    3280000 Muslims 
##  7 Afghanistan  1806         7    603    3280000 Muslims 
##  8 Afghanistan  1807         7    603    3280000 Muslims 
##  9 Afghanistan  1808         7    603    3280000 Muslims 
## 10 Afghanistan  1809         7    603    3280000 Muslims 
## # … with 37,877 more rows

In case you struggled to get there, download this dataframe here in order to be able to continue.

Now our dataset is ready, let’s plot it.

2 Plotting

  1. Load the library ggplot2 and set the global theme to theme_bw() using theme_set()
  1. Create a subset of dat concerning your origin country. For me it will be dat_france
  1. Plot the evolution of the income per capita and the number of children per woman as a function of the years, and make it look like that (notice the kinks during the two world wars):

  1. Create a subset of dat containing the data for your country plus all the neighbor countries (if you come from an island, the nearest countries…). For me, dat_france_region will contain data from France, Spain, Italy, Switzerland, Germany, Luxembourg and Belgium.
  1. Plot again income and fertility as a function of the years, but add a color corresponding to the country and a point size to its population:

  1. Load the library plotly and make the previous graphs interactive. You can make an interactive graph by calling ggplotly(), like that:
  1. Finally, you can add a slider to the interactive graph allowing selecting a value for another variable (just like in the video) by adding the keyword frame = in the chart’s aesthetics. So now, make the graph of the video ! (you can also add the aesthetics id=Country to show the country name in the popup when hovering on a point).
---
title : "R Exercises - Religion and babies"
date  : "`r Sys.Date()`"
output: 
    html_document:
        toc            : true
        toc_float      : true
        toc_depth      : 4
        highlight      : tango
        number_sections: true
        code_download  : true
        code_folding   : show
params: 
    solution:
        value: true
---

----

In these exercises we will see the power of the libraries `ggplot2` and `plotly` to make sense of statistical data. The goal is to reproduce the moving chart that you can see in this video from Hans Rosling -- I invite you to watch his other videos, they are quite enlightning and inspiring:

<div style="max-width:854px"><div style="position:relative;height:0;padding-bottom:56.25%"><iframe src="https://embed.ted.com/talks/hans_rosling_religions_and_babies" width="854" height="480" style="position:absolute;left:0;top:0;width:100%;height:100%" frameborder="0" scrolling="no" allowfullscreen></iframe></div></div>

<br>
<br>

For this, we will need to gather the data:

- From [Gapminder](https://www.gapminder.org/data/), data per country and per year from 1800 to 2018:
    - [The children per woman total fertility](Data_Religion/children_per_woman_total_fertility.csv)
    - [The income per capita](Data_Religion/income_per_person_gdppercapita_ppp_inflation_adjusted.csv)
    - [The total population](Data_Religion/population_total.csv)
- From the [PEW research center](https://www.pewforum.org/2015/04/02/religious-projection-table/2010/percent/all/), data per country:
    + [The religious composition](Data_Religion/religion.csv)

------- 

# Data handling

The first thing to do is to load and regroup all these datasets into a single one.

1. Load the `tidyverse` library and, using `read_csv()`, load the 4 datasets in 4 separate data.frames called `children`, `income`, `pop` and `religion`.

```{r include=params$solution, warning = FALSE, message=FALSE, cache=FALSE}
library(tidyverse)
library(readxl)
children <- read_csv("Data_Religion/children_per_woman_total_fertility.csv")
income   <- read_csv("Data_Religion/income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
pop      <- read_csv("Data_Religion/population_total.csv")
religion <- read_csv("Data_Religion/religion.csv")
```

2. To reproduce the chart on the video, we need to determine the dominant religion in each country. In the `religion` dataset, add a column `Religion` that will give the name of the dominant religion for each country. For this, you might want to use this method that returns the name of the column containing the maximum of each row of a `data.frame`:

```{r include=TRUE, warning = FALSE, message=FALSE, cache=FALSE}
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
DF
colnames(DF)[max.col(DF)]
```

```{r include=params$solution, warning = FALSE, message=FALSE, cache=FALSE}
DF <- religion %>% select(Buddhists:Unaffiliated)
religion <- religion %>% 
        mutate(Religion=colnames(DF)[max.col(DF)])
```

3. Using `pivot_longer()`, make all datasets tidy. 

- `children` should now contain 3 columns: `Country`, `Year` and `Fertility`. 
- `income` should now contain 3 columns: `Country`, `Year` and `Income`. 
- `pop` should now contain 3 columns: `Country`, `Year` and `Population`. 

We will only consider data from 1800 to 2018. Example of syntax using the pipe operator `%>%`:

```{r}
DF <- read_table("name  2010  2011  2012  2014
Kevin  10    11   12   123
Jane   122   56   23   4
"
)
DF
DF %>% 
    select(name, '2010':'2012') %>% 
    pivot_longer(col=-name,
                 names_to="Year", 
                 values_to="Score",
                 names_transform=list(Year = as.numeric))
```
The line `names_transform=list(Year = as.numeric)` is here to convert the character year values to numerical values.

```{r include=params$solution, warning = FALSE, message=FALSE, cache=FALSE}
children <- children %>% 
                pivot_longer(col=-Country, 
                             names_to="Year", 
                             values_to="Fertility",
                             names_transform=list(Year = as.numeric))
income   <- income %>% 
                select(Country, '1800':'2018') %>% 
                pivot_longer(col=-Country, 
                             names_to="Year", 
                             values_to="Income",
                             names_transform=list(Year = as.numeric))
pop      <- pop %>% 
                select(Country, '1800':'2018') %>% 
                pivot_longer(col=-Country, 
                             names_to="Year", 
                             values_to="Population",
                             names_transform=list(Year = as.numeric))
```

4. Now we want to combine all these datasets into a single one called `dat`, containing the columns `Country`, `Year`, `Population`, `Religion`, `Fertility` and `Income`. Look into the `inner_join()` function of the `dplyr` library (which is part of the `tidyverse` library). For the `religion` dataset, we will consider that the proportions of 2010 are representative of all times.

```{r include=params$solution, warning = FALSE, message=FALSE, cache=FALSE}
dat <- inner_join(children, income) %>%
        inner_join(pop) %>%
        inner_join(religion %>% 
                    filter(Year==2010) %>% 
                    select(Country, Religion)
                  )
```

You should end up with a dataset like this one:

```{r echo=FALSE, warning = FALSE, message=FALSE, cache=FALSE}
dat
```

In case you struggled to get there, [download this dataframe here in order to be able to continue](Data_Religion/dat.csv).

Now our dataset is ready, let's plot it.

# Plotting

1. Load the library `ggplot2` and set the global theme to `theme_bw()` using `theme_set()`

```{r include=params$solution, warning = FALSE, message=FALSE, cache=FALSE}
library(ggplot2)
theme_set(theme_bw())
```

2. Create a subset of `dat` concerning your origin country. For me it will be `dat_france`

```{r include=params$solution, warning = FALSE, message=FALSE, cache=FALSE}
dat_france <- dat %>% filter(Country=="France")
```

3. Plot the evolution of the income per capita and the number of children per woman as a function of the years, and make it look like that (notice the kinks during the two world wars):

```{r echo=params$solution, warning = FALSE, message=FALSE, cache=FALSE}
ggplot(data=dat_france, aes(x=Year, y=Income))+
    ggtitle("Household income in France")+
    xlab("Year")+
    ylab("Household income per capita per year [constant $]")+
    annotate("rect", xmin=1914, xmax=1918, ymin=-Inf, ymax=Inf, alpha=.3)+
    annotate("rect", xmin=1939, xmax=1945, ymin=-Inf, ymax=Inf, alpha=.3)+
    geom_point(alpha=0.2, size=5)+
    geom_smooth()
```

```{r echo=params$solution, warning = FALSE, message=FALSE, cache=FALSE}
ggplot(data=dat_france, aes(x=Year, y=Fertility))+
    ggtitle("Fertility in France")+
    xlab("Year")+
    lims(y=c(0,5))+
    ylab("Children per woman")+
    annotate("rect", xmin=1914, xmax=1918, ymin=-Inf, ymax=Inf, alpha=.3)+
    annotate("rect", xmin=1939, xmax=1945, ymin=-Inf, ymax=Inf, alpha=.3)+
    geom_line(size=2, color="red")
```

4. Create a subset of `dat` containing the data for your country plus all the neighbor countries (if you come from an island, the nearest countries...). For me, `dat_france_region` will contain data from France, Spain, Italy, Switzerland, Germany, Luxembourg and Belgium.

```{r include=params$solution, warning = FALSE, message=FALSE, cache=FALSE}
dat_france_region <- dat%>%
        filter(Country %in% c("France", "Spain", "Italy", 
                             "Switzerland", "Germany", "Luxembourg", "Belgium"))
```

5. Plot again income and fertility as a function of the years, but add a color corresponding to the country and a point size to its population:

```{r echo=params$solution, warning = FALSE, message=FALSE, cache=FALSE}
ggplot(data=dat_france_region, aes(x=Year, y=Income, col=Country, size=Population))+
    ggtitle("Household income in France")+
    xlab("Year")+
    ylab("Household income per capita per year [constant $]")+
    annotate("rect", xmin=1914, xmax=1918, ymin=-Inf, ymax=Inf, alpha=.3)+
    annotate("rect", xmin=1939, xmax=1945, ymin=-Inf, ymax=Inf, alpha=.3)+
    geom_point(alpha=0.5)
```

```{r echo=params$solution, warning = FALSE, message=FALSE, cache=FALSE}
ggplot(data=dat_france_region, aes(x=Year, y=Fertility, col=Country, size=Population))+
    ggtitle("Fertility in France")+
    xlab("Year")+
    lims(y=c(0,5))+
    ylab("Children per woman")+
    annotate("rect", xmin=1914, xmax=1918, ymin=-Inf, ymax=Inf, alpha=.3)+
    annotate("rect", xmin=1939, xmax=1945, ymin=-Inf, ymax=Inf, alpha=.3)+
    geom_point(alpha=.5)
```

6. Load the library `plotly` and make the previous graphs interactive. You can make an interactive graph by calling `ggplotly()`, like that:

```{r include=TRUE, warning = FALSE, message=FALSE, cache=FALSE}
library(plotly)
P <- ggplot(data = dat_france, aes(x=Population, y=Income))+
        geom_point()
ggplotly(P)# add dynamicTicks=TRUE allows redrawing ticks when zooming in
``` 

7. Finally, you can add a slider to the interactive graph allowing selecting a value for another variable (just like in the video) by adding the keyword `frame =` in the chart's aesthetics. So now, make the graph of the video ! (you can also add the aesthetics `id=Country` to show the country name in the popup when hovering on a point).


```{r include=params$solution, warning = FALSE, message=FALSE, cache=FALSE}
library(plotly)
P <- dat %>% filter(Year>=1900) %>% 
    ggplot(aes(x     = Income, 
               y     = Fertility, 
               frame = Year, 
               col   = Religion, 
               size  = Population,
               id    = Country))+
        geom_point(alpha=0.5)+
        ggtitle("Fertility vs. Income in the World")+
        xlab("Household income per capita per year [constant $]")+
        lims(y=c(0,8))+
        scale_x_log10()+
        scale_size(range=c(1, 15))+
        ylab("Children per woman")
ggplotly(P)
```
